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Abstract 

Background: Hepatitis B virus (HBV) is an important infectious agent that causes widespread concern because 
billions of people are infected by at least 8 different HBV genotypes worldwide. However, reconstruction of the 
phylogenetic relationship between HBV genotypes is difficult. Specifically, the phylogenetic relationships among 
genotypes A, B, and C are not clear from previous studies because of the confounding effects of genotype 
recombination. In order to clarify the evolutionary relationships, a rigorous approach is required that can effectively 
explore genetic sequences with recombination. 

Result: In the present study, phylogenetic relationship of the HBV genotypes was reconstructed using a consensus 
phylogeny of phylogenetic trees of HBV genome segments. Reliability of the reconstructed phytogeny was 
extensively evaluated in agreements of local phylogenies of genome segments. 

The reconstructed phylogenetic tree revealed that HBV genotypes B and C had a closer phylogenetic relationship 
than genotypes A and B or A and C. Evaluations showed the consensus method was capable to reconstruct reliable 
phylogenetic relationship in the presence of recombinants. 

Conclusion: The consensus method implemented in this study provides an alternative approach for reconstructing 
reliable phylogenetic relationships for viruses with possible genetic recombination. Our approach revealed the 
phylogenetic relationships of genotypes A, B, and C of HBV. 
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Background 

Hepatitis B virus (HBV), a serious global public health 
problem, is the 10th leading cause of death worldwide. 
Approximately 2 billion people worldwide are infected 
with this virus and about 350 million live with chronic 
infection. An estimated 600,000 people die each year 
due to acute or chronic consequences of hepatitis B [1]. 

There are eight well-recognized HBV genotypes, labeled 
A through H, each pair of which differs by at least 8% of 
the complete genome sequence. The distribution of the ge- 
notypes varies across geographic regions with population 
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migration [2,3]- Type A is located mostly in Europe, South 
Africa, and North America; types B and C are prevalent in 
East Asia, Southeast Asia, and Oceania; type D is common 
in South Asia, the Mediterranean area, and the Middle 
East; type E is predominant in sub-Saharan Africa; types F, 
G, and H are common in the New World and are also 
found in some European countries, such as France and 
Germany. Within the 8 genotypes, HBV can be further di- 
vided into different subtypes that differ by 4% to 8% of the 
genome [3]. Besides the 8 well known genotypes, there are 
two more putative genotypes that could not be classified 
into those groups above, genotype I and J [4,5]. 

Several studies have reported controversial phylogenetic 
relationships among HBV genotypes, especially genotypes 
A, B, and C. Three reports suggest that genotypes A and 
C have a closer phylogenetic relationship than genotype B 
with A or C [4,6,7]. The above phylogenetic relationship 
has been brought into question, however, by the results of 
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other studies demonstrating that genotypes B and C have 
a closer phylogenetic relationship than genotype A with B 
or C [8-10]. One study also reported that the phylogenetic 
relationship between genotypes A and B is much closer 
than that of genotype C with A or B [3]. Further, three 
other studies were unable to elucidate the relationship of 
the genotypes in detail and suggested that the three geno- 
types were on the same phylogenetic clade [11-13]. The 
ambiguity of the phylogenetic relationship of the HBV 
genotypes is thought to be due in part to historical recom- 
bination in the HBV genome [8,9,14]. Recent efforts 
have been made to detection HBV recombinants in HBV 
genome and provided a comprehensive picture about the 
distribution of recombination in HBV genome [14-16]. 

In order to reduce the confounding effects of recombi- 
nation in the process of phylogeny reconstruction, Fares 
and Holmes (2001) utilized gene non-overlapping regions 
of the HBV genome to reconstruct the phylogeny, but the 
reconstructed phylogeny from their study was not consis- 
tent with the geographic prevalence of the genotypes; i.e., 
genotypes B and C were distributed geographically closer 
while they were more distant in their reconstructed phylo- 
genetic relationship [3,6]. Therefore, it might be necessary 
to incorporate the whole-genome information of HBV, 
and it is highly unlikely that an approach that does not 
consider the recombination will solve the ambiguity of the 
phylogenetic relationship of HBV genotypes. To resolve 
the ambiguity, we were offered an opportunity to propose 
and validate effective phylogenetic methods for exploring 
genetic sequences with recombination. 

Here, we reconstructed the phylogenetic relationship of 
HBV genotypes using a consensus-tree approach to inte- 
grate whole-genome information. The overall phylogeny 
indicated that HBV genotypes B and C have a closer phylo- 
genetic relationship than genotype A with B or C. Multi- 
level evaluations implicated the reconstructed phylogenetic 
tree of HBV genotypes was reliable in many perspectives. 
We did not consider this report as a solely clarification of 
HBV phylogenies but rather a communication of the 
implemented methods. The methods implemented in this 
study could be an alternative choice for phylogeny recon- 
struction in the presence of recombinant. 

Results 

Consensus relationship of local phylogenies 

The phylogenetic relationship can be represented as a 
phylogenetic network with reticulations when recombi- 
nation occurs among sequences. For three sequences 
with a known root, the phylogenetic relationship can be 
shown as a rooted triplet with reticulations (Figure 1A; a 
four-taxa quartet, if one of the taxa is the given out-group, 
then the quartet is called a rooted triplet). In this scenario, 
except homoplasy, formation of the reticulation can be 
generally explained as a consequence of recombination 



between sequence Seql and Seq3 when a recombination 
event is highly possible [17,18]. In the presence of recom- 
bination, sequence Seq2 could be considered as a mosaic 
of the Seql and Seq2 following the law of parsimony, i.e., 
Occam's razor. We defined that the major phylogenetic 
relationship (shown as a rooted triplet without reticula- 
tion, Figure IB) of the three involved sequences was the 
topological relationship presented by the majority of 
phylogenetic trees of their aligned sequence segments. In 
the major phylogenetic relationship, the ancestor of the 
mosaic is the ancestral sequence that contributed the most 
genetic content (80% in Figure 1) to the mosaic compared 
with the other sequences. When a pool of the major 
rooted triplets is available to present major phylogenetic 
relationships of all possible three-sequence combinations 
for multiple sequences, a consensus tree of the major 
rooted triplets could present the major phylogenetic rela- 
tionship of all of the involved sequences. 

Tree-like phylogeny of HBV 

In the present study, the consensus phylogenetic rela- 
tionship of the involved HBV sequences was constructed 
using the majority consensus of local phylogenies of all 
genome segments (see Methods for details). We named 
the phylogenetic relationship of a genome segment as 
the local phylogeny. When the size of all genome seg- 
ments was 250 base pairs (bp), the consensus phylogen- 
etic relationship of HBV genotypes was ambiguous such 
that genotypes A, B, and C appeared in the same clade 
of the consensus tree forming a trifurcation (Figure 2A). 
When the segment size was increased to 500 bp, 750 bp, 
1000 bp, 1250 bp, or 1500 bp, however, the consensus 
topological relationship of the HBV genotypes was the 
same (Figure 2B). In these analyses, the B and C geno- 
types had a closer phylogenetic relationship than that of 
genotype A with B or C. The close phylogenetic relation- 
ship between genotypes B and C was strongly supported 
by bootstrapping evaluation (0.99, 1000 times bootstrap- 
ping). Notably, the close relationship between genotypes B 
and C was also supported by the worldwide geographic 
prevalence of the HBV genotypes and the fact that both 
genotypes are prevalent in East Asia [3] . 

Reliability of the consensus phylogenetic relationship 

A good consensus phylogenetic tree should represent 
the majority of phylogenetic relationships of different 
segments of the HBV genome for all involved sequences. 
To gain a thorough understanding of the reliability of 
our results, we evaluated the constructed consensus 
phylogenetic trees at both the tree and branch levels. 

At the tree level, we checked the consistencies between 
the constructed consensus trees and local phylogenies of 
sequence segments (see Methods for details). Our results 
indicated that the consensus trees were well-supported by 
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Seq1 Seq2 Seq3 Seq1 Seq2* Seq3 



Figure 1 Identifying the major phylogenetic relationship from a phylogeny with reticulation. A The Seq2 is a mosaic sequence in which 
most of its components (80%) are descendants of sequence Seq1 and the remaining components (20%) are descendants of sequence Seq3. 
B. A major phylogenetic relationship can be achieved by removing the minor relationship between Seq2 and Seq3. '*' indicated this is a 
truncated sequence. 



the local phylogenies of sequence segments located at dif- 
ferent coordinates (Figure 3, Additional file 1: Figure SI). 
The mean consistencies of different segment sizes ranged 
from 0.68 to 0.75 with standard deviations ranging from 
0.02 to 0.05. More specifically, the mean ± standard devi- 
ation of the consistencies was 0.68 ± 0.05, 0.74 ± 0.05, 
0.74 ±0.04, 0.74 ±0.02, 0.75 ±0.03, and 0.72 ± 0.02 for 
segment sizes 250 bp, 500 bp, 750 bp, 1000 bp, 1250 bp, 
and 1500 bp, respectively. Further, the consistencies were 
sensitive to the size of the sequence segments, but there 
was no significant difference among different genome 
regions. When the segment size increased, the diffe- 
rence in the consistencies of different segments de- 
creased (Additional file 1: Figure SI). 

At the branch level, the reliability of each internal 
branch of the consensus phylogenetic trees was evaluated 
based on the agreement of local phylogenies with the 
specific branch (see Methods for details). The branches of 
the consensus phylogenetic trees were highly reliable. 
Agreements of the intra-genotype branches were generally 
greater than 0.90 and their 95% confidence intervals (CI) 
were very narrow in the bootstrapping evaluation (1000 
times bootstrapping, see Methods for details, Figure 4, 
Additional file 1: Figure S2). The high reliabilities at the 
branch level suggest that intra-genotype recombination 



has a limited impact on our reconstructed phylogenetic 
relationship. Reliabilities of inter-genotype branches were 
generally high (with agreements over 0.90), except for two 
branches (Figure 4, Additional file 1: Figure S2). One of 
the branches split genotypes B and C from the other geno- 
types and the other branch split genotypes A, B, and C 
from genotypes D and E. For example, when the segment 
size was 500 bp, the cluster of genotypes B and C had a 
relatively lower reliability of agreement (0.75 with 95% CI 
0.74 - 0.76, Figure 4). In the same scenario of a 500-bp 
segment size, even the branch with the poorest reliability, 
which splits genotypes A, B, and C from the others, had 
agreement of 0.65 with 95% CI 0.63-0.67. Therefore, all 
splits of the reconstructed phylogenetic relationship of the 
HBV genotypes were well-supported by the majority of the 
local phylogenies (Figure 4, Additional file 1: Figure S2). 

Further demonstration for advantage of the consensus 
method 

Maximum likelihood (ML) method is the most popular 
and comprehensive approach in studies of genetic 
phylogeny [19], as well as the studies of HBV evolution 
[6-8,12,20]. ML method builds inference on robust stat- 
istical models and searches trees for the best solution 
with maximum of likelihood value. Therefore, in many 
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Figure 2 Schematic presentation of the phylogenetic relationship of HBV genotypes A, B, and C. A. The phylogenetic relationship of the 
three genotypes is ambiguous when the analyzing window was only 250 bp in size. B. Genotypes B and C showed a closer phylogenetic 
relationship when the analysis window size was at least 500 bp. 
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Figure 3 Consistency between the consensus phylogenetic trees and corresponding local phylogenies along the HBV genome. 

Consistency was measured as a percentage of the agreement between local phylogenies of different segment sizes and the corresponding 
consensus tree. The percentage is shown on the y-axis and the x-axis shows the coordinates of local phylogenies along the aligned HBV 
sequences. The dashed line indicates the 50% agreement. 



perspectives, the ML method performs excellent in phyl- 
ogeny reconstruction [19]. To demonstrate advantage of 
our consensus method in the presence of recombination, 
we applied both our method and ML method on HBV 
sequences mixed with simulated genotype A/C recom- 
binants (see Methods for details). Using datasets with 
moderate recombinant frequency (f= 0.14), the ML 
method reconstructed incorrect phylogenetic relation- 
ship where genotype A and C was wrongly clustered 



together (Additional file 1: Figure S3). By contrast, using 
the same synthetic datasets, our consensus method 
reconstructed phylogenetic relationship with correct 
topological pattern (Additional file 1: Figure S4). It is 
worth to mention that both the method produced cor- 
rect phylogenies if the recombinants were rare in the 
simulated datasets. And further, both the methods failed 
to reconstruct correct phylogeny when the frequency of 
recombinants was very high, for example /= 0.60. 




Figure 4 Reliability of internal branches of the consensus phylogenetic tree. The reliability of each internal branch is marked in red on each 
internal branch. Only the consensus phylogenetic tree from analyzing a 500-bp window is presented. More results for other window sizes are 
shown in Additional file 1 : Figure S4. Accession Numbers of the HBV sequences are supplied in Additional file 1 : Table SI . 
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Discussion 

Phylogenetic trees are efficient representations of the 
genetic relationship of biologic sequences, although a 
phylogenetic network is more informative in applications 
involving reticulate relationships, such as those due to 
recombinant sequences [21]. Unfortunately, the currently 
available methods for reconstructing phylogenetic networks 
from genetic data containing recombinant sequences have 
very high false rates in terms of identifying the correct phyl- 
ogeny [22]. In contrast, many tree-building methods have a 
high probability for reconstructing the correct phylogeny 
for sequences without recombination [23]. Phylogenies of 
aligned short pieces of sequences are rarely affected by 
recombination when recombination is not extremely fre- 
quent [24]. A consensus of the local phylogenies of short 
sequence fragments, therefore, can be used to represent the 
phylogenetic relationship of the majority of the involved 
HBV sequences. 

Inter- and intra-genotype recombination is widely recog- 
nized as a critical factor in HBV evolution. Recombinants 
in sequence pool could lead to inconsistencies among local 
phylogenies of different fragments of the aligned sequences 
[17]. Recombination has thus posed a challenge to phylo- 
genetic studies of HBV. In addition, uncertainty regarding 
the molecular clock also interferes with the reconstructed 
local phylogenies because, for short sequence fragments, 
mutation accumulation follows a Poisson distribution with 
great variance [25]. Therefore, HBV sequence fragments 
with an extremely small size, for example 250 bp, did not 
help to distinguish genotypes B and C from genotype A in 
this study. Both recombination and the uncertainty con- 
tribute to the inconsistency between local phylogenies. For 
the same reason, it is difficult to fully identify all or most 
recombination events or completely eliminate their impact 
in phylogenetic studies based on the comparison of local 
tree topology. In this study, the phylogenetic relationship 
was reconstructed without explicitly identifying instances 
of recombination events and the reconstructed relationship 
was appropriately supported by local phylogenies at both 
the tree and branch levels. A similar approach may facili- 
tate the reconstruction of reliable tree-like phylogenetic 
relationships of viruses in future studies. 

Classic phylogenetic trees often present phylogenetic 
relationships of aligned full-length sequences. The con- 
sensus phylogenetic relationship in this report, however, 
is different. This consensus phylogenetic relationship 
extracts information from the majority of the sequences. 
A small part of the sequence fragments was automatic- 
ally ignored during the phylogeny reconstruction and 
the useful fragments may locate at different positions for 
different sequences. Excluded fragments of the same 
sequence may have the same or different genetic origins, 
but the origins make only minor genetic contributions 
to the sequences. In this way, minor ancestors of a 



sequence are ignored by the consensus phylogenetic tree. 
This method provides a natural way to extract important 
phylogenetic information from sequences containing 
recombination. 

The reliability of the consensus phylogeny was evalu- 
ated by comparing the consensus phylogeny with local 
phylogenies of sequence segments in this study. The 
phylogenies were split into rooted triplets to compare 
the consistency of the triplets during the process. In this 
novel approach, more consistency indicated smaller 
topological differences between the phylogenies and 
better reliability of the consensus phylogeny. This ap- 
proach overcomes an obvious limitation of the classical 
consensus measure. The classical measure of majority 
rule consensus actually showed a split consensus for all 
taxa without considering the number of taxa [26]. In the 
classical method, even a small difference in one or two 
branches was treated as having the same importance as 
a large difference between phylogenies. The evaluations 
in this report implemented an alternative approach in 
which a minor difference is distinguished from large 
differences. These findings provide another view of the 
reliability of consensus phylogenetic tree. 

The phylogenetic relationships of HBV genotypes A, B, 
and C that were reconstructed in this study elucidated 
the geographic prevalence of the HBV genotypes and 
their phylogenetic relationship. In China and other East 
Asian countries, HBV carriers often have HBV genotype 
B or C, while most Japanese carriers have HBV genotype 
C. Genotype A is rare in East Asia and is found mostly 
in Western Europe, America, India, and Africa [3]. The 
global prevalence of HBV suggests that genotypes B and 
C have a close phylogenetic relationship. Therefore, 
based on the present findings, the map indicating the 
origin and historical dispersion of the HBV genotypes 
that identifies genotype A as being more closely related 
to genotype B or C appears to be incorrect. In fact, the 
controversial results about the phylogenetic relationships 
among these genotypes reported in previous publications 
[3-13] have caused confusion. Our study sheds light on 
the origin and historical dispersion of HBV by using a 
comprehensive approach to confirm that genotypes B 
and C are closer relatives. 

The effects of recombination were eliminated in our 
analysis to make the result robust. Our simulation sug- 
gested that the consensus method was superior to regu- 
lar ML method in the presence of recombination. The 
simulation also supplied clues of possible explanation for 
the difference between our consensus phylogenetic rela- 
tionship and Shi et al.'s ML tree of HBV genotypes [16]. 
However, it is a limitation in our current study that this 
approach is not capable of indentifying historical recom- 
bination events in HBV genome. Fortunately, several 
publications have reported some progress in this field 
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[14-16,27-31]. Evolutionary history of HBV genome 
recombination will possibly be clarified in details in fu- 
ture although rigorous improvements of analysis tools 
are necessary. 

Conclusions 

Phylogenetic relationship can be reconstructed on ma- 
jority of phylogenetic information of sequence segments 
without explicitly identifying historical recombination 
events. The serial phylogenetic methods proposed and 
employed in this study provide an effective approach for 
reconstructing reliable phylogenetic relationships for 
viruses with possible genetic recombination. In this ap- 
proach, HBV genotypes B and C had a closer phylogenetic 
relationship than genotypes A and B or A and C. 

Methods 

Data preparation 

We retrieved 3281 complete sequences of human HBV 
and one full-length sequence of woolly monkey HBV 
from the GenBank of the National Center for Biotech- 
nology Information available on April 2011 [32]. The full 
sequence set comprised 320 genotype A, 387 genotype 
B, 836 genotype C, 383 genotype D, 221 genotype E, 72 
genotype F, 15 genotype G, 19 genotype H, and 1043 
unknown or uncertain genotype sequences. The geno- 
types assigned to the different sequences were obtained 
either directly from the GenBank records or from the 
associated publications. 

All the sequences were screened to exclude entries that 
were related to patents, artificial mutants, and identical se- 
quences. Further, sequences with unknown, uncertain 
genotype or documented recombination information were 
removed. The remaining sequences were aligned using the 
MUSCLE software with default parameters [33]. Results 
of the alignments were checked manually for further 
validation. Gaps (insertions/deletions) and all nonstandard 
nucleotide bases (all characters except A, C, G, T, and -) 
were considered as missing values in further analysis. 
After that, sequences with more than 20% gaps or missing 
data were removed. Positions of sites were identified by 
their relative positions to the traditional hypothetical 
EcoRI site in the full-genome alignments. 

To achieve a fair and representative presentation for all 
the genotypes, we applied a multi-step procedure to re- 
move extra sequences from the initial sequences set. In 
the first step, we sequentially removed sequences with 
high similarity to any others until all remaining sequences 
had a pairwise difference larger than or equal to 2.5%. 
After the initial cleaning, the sequence pool had 379 full- 
length HBV sequences (including 38 genotype A, 82 geno- 
type B, 138 genotype C, 77 genotype D, 32 genotype E, 9 
genotype F, 2 genotype G, and 3 genotype H). 



From the filtered sequences, 30 sequences were ran- 
domly drawn for each of genotypes A, B, C, D, and E. 
Genotypes F, G, and H were not included in further 
analysis because the purpose of the present study was to 
elucidate the phylogenetic relationship of genotypes A, B, 
and C. Furthermore, to involving the limited sequences of 
genotypes F, G, and H (9 genotype F, 2 genotype G, and 3 
genotype H) in the analysis may produce problematic 
results due to unequal number of involving sequences of 
each genotype. The full-length HBV sequence of woolly 
monkey was considered as an ancestral reference 
(outgroup) in this study [34]. This woolly monkey HBV 
sequence and the randomly selected human HBV se- 
quences were combined together and aligned by MUSCLE 
with default parameter settings. To improve the data qua- 
lity of the aligned sequences, GBLOCKs was used to re- 
move aligned columns with more than half gaps or with 
low data quality [35,36]. In total, 105 columns (3.2%) were 
removed in the process. The working dataset therefore 
included 151 full-length sequences of HBV for further 
phylogenetic investigation. 

Constructing a consensus phylogenetic relationship 

A sliding window approach was used in which an analy- 
zing window moves along the aligned HBV sequences 
with the same step length (10 bp), but a different window 
size in different runs. The work of sliding window is simi- 
lar with that of previous publication about recombination 
detection [13]. Analysis of the results from different runs 
with different window sizes (250 bp, 500 bp, 750 bp, 
1000 bp, 1250 bp, or 1500 bp) could show how differences 
in window size impact phylogeny reconstruction. In each 
stop of the window movement, local phylogenetic trees of 
the aligned sequence fragments were reconstructed by 
Ninja software using the neighbor-joining method and 
Kimura 2 parameter model [37] . With the given outgroup, 
all the local phylogenetic trees were further split into 
primary rooted triplets. From each local phylogenetic tree, 
551,300 ( C^ 50 , the number of combinations of any 3 se- 
quences from the given set of 150 HBV sequences) primary 
rooted triplets were obtained. Because of the circular cha- 
racteristic of HBV genome, the initial start of HBV se- 
quences were concatenated at the end of the original 
sequences, in order to make each base have an equal cove- 
rage by the sliding window. 

The primary rooted phylogenetic triplets of each 
window in each run were filtered to remove the minor 
triplets that presented two different minor phylogenetic 
relationships. It is worth to note here that, for every 
combination with 3 human HBV sequences and the root, 
there were three possible topologies for each window in 
each run and the three topologies were not compatible 
with each other. We took only one of the possible 
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topologies, i.e. the major triplet, for further analysis. The 
removed triplets were less common and inconsistent 
with the major phylogenetic relationship presented in 
the same analyzing window (see Results for further 
details, Figure 1). The remaining rooted triplets from all 
the analyzing windows in the same run were then 
pooled together to reconstruct a consensus tree using 
the rooted triplet consensus method [38]. Ewing, et al. 
(2008) declared that the consensus method based on 
rooted triplets outperformed the extended majority rule 
consensus strategy [38] . We constructed consensus phylo- 
genetic relationships of HBV genotypes in different runs 
separately using different window sizes. 

Evaluating the reliability of the reconstructed 
phylogenetic relationship 

The reliability of the reconstructed phylogenetic relation- 
ship of HBV sequences can be evaluated by comparing 
the consensus phylogenetic relationship with phylogenetic 
trees of genome segments (local phylogenetic trees). Good 
consistency between them would indicate good reliability 
of the consensus phylogeny. In this study, multiple com- 
parisons were conducted to achieve a thorough under- 
standing of the reliability. 

First the consistency of the reconstructed consensus 
phylogeny and local phylogenetic trees was investigated on 
a genome-segment level. For each genome segment, local 
neighbor-joining trees (involving all 151 taxa) were built 
using Ninja software with the aforementioned substitution 
model [37]. We then dissected the local neighbor-joining 
trees and our consensus tree-like phylogenetic relationship 
into rooted triplets. For phylogenies with n taxa (including 
an outgroup), the proportion of compatible triplets be- 
tween the local tree and consensus tree could be obtained 
by k/ C\_ x , where k is the total number of compatible trip- 
lets and C\_ x is the number of total rooted triplets (n = 151 
in this case). The proportions were calculated for all gen- 
ome segments and then used as a measure for the agree- 
ment of reconstructed consensus phylogeny and local 
phylogenetic trees. 

Second, the consistency of internal branches (nontrivial 
splits) of the consensus phylogenetic tree and local phylo- 
genetic trees was evaluated by checking how often the 
nontrivial splits of the consensus tree were supported by 
nontrivial splits of local phylogenetic trees. For any given 
internal branch (with m children) of an «-taxa consensus 
tree (including an outgroup), the phylogenetic relationship 
was dissected into rooted triplets with a total number 
C'^_ m _ 1 C^ to form a consensus rooted triplet pool. The 
probability that a given rooted triplet from the consensus 
rooted triplet pool was supported by dissected rooted 
triplets of local phylogenetic tree could be estimated by 
} '/ ' (jCn-m-iCm) > where y was the number of dissected 



rooted triplets of the local phylogenetic trees which shared 
the same phylogenetic relationships with their corre- 
sponding triplets of the consensus tree, and was the total 
number of local neighbor-joining trees determined by the 
size of the sliding window and length of the moving step. 
The 95% CI of the estimation was obtained by a boot- 
strapping method in which local phylogenetic trees were 
randomly sampled with replacements to generate an artifi- 
cial rooted triplet pool for the aforementioned evaluation. 

Performance demonstration in the presence of 
recombination 

Synthetic data was generated by introducing simulated 
genotype A/C recombinants to the raw data set that was 
used for aforementioned investigation of HBV phylogeny. 
For a pair of sequences, one from each of the two geno- 
types, we gave the recombination probability p. Expected 
frequency of recombinants in the sequence pool of geno- 
type A, C, and A/C recombinant could be estimated as 
/*= 1 — (1 — jK>) 3 ° because 30 sequences of each genotype 
were included in the raw data set. We considered all pos- 
sible pairs of the involving sequences of genotypes A and 
C to simulate the occurrence of recombination between 
the two genotypes. When a recombination occurred be- 
tween a pair of sequences with probability p, location of 
the recombinant fragment was randomly chosen on the 
HBV genome, and length of the recombinant fragment 
was determined by the empirical length distribution of 
recombinants from Yang et al's study [15]. Because HBV 
genome is a circular molecular, we allowed recombinant 
fragment cover the junction of sequence end and start. 

Phylogenetic relationship of the synthetic data was 
reconstructed by using ML method. Before the recon- 
struction, jModelTest2 was executed to choose the best-fit 
model from the 88 candidate models [39]. Since GTR + 
I + G model was selected as the best-fit model, a ML tree 
was built using the ML method implemented in PALM 
package [40]. The same synthetic data was also analyzed 
by our consensus method to produce a consensus tree. 
By given different probability of recombination p, we per- 
formed the data simulation and phylogeny reconstruction 
multiple times to achieve a thoughtful evaluation. 

Additional file 



Additional file 1: Figure SI. Consistency of the consensus 
phylogenetic tree and local phylogenies along HBV genome for window 
size 1000 bp, 1250 bp and 1500 bp. The consistency is measured in 
percentage of the agreement between local phylogenies and 
corresponding consensus tree. The percentage is showed on y-axis. The 
x-axis represents coordinates of local phylogenies along HBV genome. 
The dashed line indicates the 50% agreement. Figure S2. Reliability of 
internal branches of the consensus phylogenetic tree. Reliability of the 
internal branches (nontrivial splits) of consensus phylogenetic tree is 
evaluated in rooted triplet prospective. The values on the branch are the 
median of 1000 times bootstrapping, confidence interval were not 
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showed. The figures S2.1-S2.5 are for results of window size 250 bp, 
750 bp, 1000 bp, 1250 bp, and 1500 bp, respectively. Accession Numbers 
of the HBV sequences were listed in Table SI Figure S3. ML tree of a 
synthetic HBV dataset. With the simulated recombinants of genotype A 
and C, ML method failed to reconstruct correct phylogeny for synthetic 
data. The genotypes A and C formed a false cluster. Details of the 
simulated recombinants were presented in Table S2. Figure S4. 
Consensus tree of a synthetic dataset. Using synthetic data with 
simulated recombinants, our consensus method successfully restore the 
original phylogenetic relationship of HBV genotypes, where the genotype 
B and C formed the correct cluster. This figure shows the consensus 
phylogeny of sliding window size 500 bp. Details of the simulated 
recombinants were presented in Table S2. Table SI. Accession number 
of HBV sequences involved in phylogenetic trees. All these sequences 
were retrieved from the GenBank of the National Center for 
Biotechnology Information. Table S2. Details of simulated recombinants 
in a synthetic dataset. 
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